200824 Expt.299: RNA Interactome Capture (RIC)

This is experiment replicates the original RIC protocol from Castello et al. (2012) with the following changes:

  • Cell number reduced to 10e6
  • Single shot MS
  • LFQ quantitation on pools of n=6 nCL/cCL conditions The intention of this experiment is to provide a baseline against which other single-shot/LFQ methods can be developed.

1. Import Modules and Files

Custom Functions
jwrangle.importMixedFiles( )

I generally import everything I MIGHT use at the start and set up pathing using the OS-agnostic pathlib.

In [1]:
#### File utilities
import os
import pandas as pd
from pathlib import Path
from imp import reload

#### Data Wrangling
import copy
import numpy as np

#### RBP Suite Modules
import jwrangle
import jvis
import jinspect
import jtest
import jweb

#### Sequence Tools
from Bio import SeqIO

#### Graphical Packages
import upsetplot as upset
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

#### define working directories
cwd  = Path(os.getcwd())
base_path = Path(os.path.join(*cwd.parts[:cwd.parts.index('experiments')]))

#### MaxQuant proteinGroups & evidence files
MQ_folder = jwrangle.importMixedFiles(cwd / 'MaxQuant')
MQ_folder.keys()

pGroups = MQ_folder['proteinGroups.txt']
evidence = MQ_folder['evidence.txt']

#### Inspect MQ setup
MQ_folder['parameters.txt'].head(9)
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (55) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
Out[1]:
Parameter Value
0 Version 1.6.7.0
1 User name smith.j
2 Machine name MSCYPHER-253
3 Date of writing 08/24/2020 19:24:56
4 Include contaminants True
5 PSM FDR 0.01
6 PSM FDR Crosslink 0.01
7 Protein FDR 0.01
8 Site FDR 0.01

2. Metadata Creation

Custom Functions
jwrangle.MQ_writeMetadata( )

Metadata tabulates the test conditions for ALL experiments that shared the same MQ search and thuss all experiments that comprise the MQ outputs. Metadata can also be done in a spreadsheet program. Here, I have created my metadata programmatically instead (I simply find this easier).
The metadata table gives users the opportunity to rename samples and define the experimental parameters for the data. This task can be expecially complex for MaxQuant because a unified output is generated even if distinctly separate experiments are searched as a batch and with different parameters applied.
The function jwrangle.MQ_writeMetadata( ) will take a metadata table, rename all samples in the proteinGroups and evidence files, assign alternative filenames, and save new copies to be used in future analyses.

In [4]:
#### Inspect column names
colnames = list(pGroups.columns.values)

#### View experiment names as a list
experiment_names = []
for i in colnames:
    if 'Intensity ' in i:
        experiment_names.append(i.replace('Intensity ', ''))

#### I want to assign new names....
new_expt_names = ['nCL_OdT_A', 'nCL_OdT_B', 'nCL_OdT_C', 'nCL_OdT_D', 'nCL_OdT_E', 'nCL_OdT_F', 
                  '254_OdT_A', '254_OdT_B', '254_OdT_C', '254_OdT_D', '254_OdT_E', '254_OdT_F']

#### Assign condition identifiers
condition =  []
for i in new_expt_names:
    condition.append(i[:3])

#### Create a list of associated replicate identifiers
replicate = []
for i in experiment_names:
    replicate.append(i[-1])

#### The MQ_groups dictionary assigns each condition to its MQ group parameter
MQ_groups = {'RIC-OdT':['nCL','254']}

#### We then use the condition list to create an MQ_groups list
Grp_parameters = []
for label in condition:
    for key, value in MQ_groups.items():
        if label in value:
            Grp_parameters.append(key)

exp_df = pd.DataFrame(
    {'experiment': experiment_names,
     'condition': condition,
     'replicate': replicate,
     'sample':new_expt_names,
     'measure':['Intensity']*len(new_expt_names),                  # adding this column allows our metadata file to be compatible with Proteus
     'MQgroups':Grp_parameters
    })

exp_df
Out[4]:
experiment condition replicate sample measure MQgroups
0 NC-A nCL A nCL_OdT_A Intensity RIC-OdT
1 NC-B nCL B nCL_OdT_B Intensity RIC-OdT
2 NC-C nCL C nCL_OdT_C Intensity RIC-OdT
3 NC-D nCL D nCL_OdT_D Intensity RIC-OdT
4 NC-E nCL E nCL_OdT_E Intensity RIC-OdT
5 NC-F nCL F nCL_OdT_F Intensity RIC-OdT
6 PC-A 254 A 254_OdT_A Intensity RIC-OdT
7 PC-B 254 B 254_OdT_B Intensity RIC-OdT
8 PC-C 254 C 254_OdT_C Intensity RIC-OdT
9 PC-D 254 D 254_OdT_D Intensity RIC-OdT
10 PC-E 254 E 254_OdT_E Intensity RIC-OdT
11 PC-F 254 F 254_OdT_F Intensity RIC-OdT
In [5]:
MQ_expt299 = jwrangle.MQ_writeMetadata(pGroups, evidence, exp_df, 'e299', cwd)
metadata.csv created in metadata folder
e299_proteinGroups_metalabeled.txt created in MaxQuant folder
e299_evidence_metalabeled.txt created in MaxQuant folder

3. Re-Annotate the MaxQuant pGroups with stable Gene IDs

Functions
jweb.mapAnyID( )
jwrangle.importMixedFiles( )

MaxQuant does a good job of assigning a Gene name to each protein group. Presumably these gene names come from the FASTA. However:

  • Sometimes it fails to find a gene name
  • Sometimes it will assign an ID that is not a gene or include out-of-place characters
  • It doesn't always seem to be consistent
  • If the gene name originates from the FASTA then repeating the MQ search with an updated FASTA is the only way to update the gene IDs.
  • Use of a mapping service will standardise the ID conversion practices between my datasets and those of others, including RNA-Seq.

To avoid these problems we will remap the Majority protein IDs to ENTREZ gene IDs. jweb.mapAnyID( ) will retrieve all possible genes for each protein group, and will also select a primary ID to singularly represent the group by a consistent method. This is a very flexible function, see help( ) for further explanation. From this point, the MQ 'Gene names' column will no longer be necessary. This function can also handle ID mapping to and from almost any convention.

Ensuring our proteins have a consistent gene naming strategy is essential for inter-experiment comparison and the later use of set methods. It also creates a standard that can be applied for accurately mapping RNA-Seq results and thus aid in future mapping of protein-RNA partners.

In [2]:
#### If not already loaded, read in the metadata-adjusted files
metadata = pd.read_csv(cwd / 'metadata' / 'e299_metadata.csv', index_col = 0)
pGroups = pd.read_csv(cwd / 'MaxQuant' / 'e299_proteinGroups_metalabeled.txt', delimiter = '\t')
evidence = pd.read_csv(cwd / 'MaxQuant' / 'e299_evidence_metalabeled.txt', delimiter = '\t')
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (55) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
# #### Dynamically remap gene names in our proteinGroups file and save a copy
# pGroups_remap = jweb.mapAnyID_gPro(pGroups['Majority protein IDs'].tolist(), splitstr = [';', '-'], geneProductType = 'protein', 
#                               gConvertOrganism = 'hsapiens', gConvertTarget = 'ENTREZGENE', writetopath = [cwd, 'pGroups_remap'], writeTargetsAsList = 'NO')
In [4]:
#### If not already loaded, read in the remapped proteinGroups file
pGroups_remap = jwrangle.importMixedFiles(cwd / 'downloads' / 'pGroups_remap', dropSuffix = 'yes')
pGroups_remap.keys()
Out[4]:
dict_keys(['id_map', 'query_map'])
In [5]:
#### jwrangle.importMixedFiles() returns a dictionary where keys = files. We want the 'id_map' table created by jweb.mapAnyID_gPro().
#### We'll rename the Query column and drop duplicates so the table can be merged with our proteinGroups table.
id_map = pGroups_remap['id_map'].rename(columns={'Query': 'Majority protein IDs'}).drop_duplicates()
id_map.head(2)
Out[5]:
Majority protein IDs ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name UNIPROT_gPro status
0 A0A087WZZ9;E5RIU6;P06493;A0A024QZP7;P06493-2;P... CDK1;CDK2;CDK3;None CDK1 cyclin dependent kinase 1 [Source:HGNC Symbol;... SWISSPROT
1 A0A0J9YXP0;A0A087X171;A0A087WXG7;A0A024R214;Q9... CPEB1;None CPEB1 cytoplasmic polyadenylation element binding pr... SWISSPROT
In [6]:
#### Now use merge to add these new columns to our proteinGroups table
pGroups_map = pd.merge(pGroups, id_map, on='Majority protein IDs', how='left')

#### Check the tables are merged by viewing column elements from each.
pGroups_map[id_map.columns.tolist() + ['Peptide IDs']].head(2)
Out[6]:
Majority protein IDs ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name UNIPROT_gPro status Peptide IDs
0 A0A087WZZ9;E5RIU6;P06493;A0A024QZP7;P06493-2;P... CDK1;CDK2;CDK3;None CDK1 cyclin dependent kinase 1 [Source:HGNC Symbol;... SWISSPROT 6930;8293;8684;13624
1 A0A0J9YXP0;A0A087X171;A0A087WXG7;A0A024R214;Q9... CPEB1;None CPEB1 cytoplasmic polyadenylation element binding pr... SWISSPROT 1517;4259;4817;5575;8556;10478

4. Review Contaminants by Sample

Functions
jinspect.MQ_getContaminants( )
MQ_getContaminants_sbplot( )
jwrangle.importMixedFiles( )

We can extract the conaminants from our proteinGroups file using jinspect.MQ_getContaminants( ). These extracted table will return log2(iBAQ values).
Contaminants can then be reviewed with _MQ_getContaminantssbplot( ).

In [7]:
#### Extract contaminants
contaminants = jinspect.MQ_getContaminants(pGroups_map, metadata)
contaminants.head(2)
Out[7]:
nCL_OdT_A nCL_OdT_B nCL_OdT_C nCL_OdT_D nCL_OdT_E nCL_OdT_F 254_OdT_A 254_OdT_B 254_OdT_C 254_OdT_D 254_OdT_E 254_OdT_F
Protein ID: Gene
CON__P07477: PRSS3P2 22.222317 22.211337 29.728954 22.188338 22.071889 22.607723 24.213267 22.36263 24.079771 24.855668 23.871453 24.239361
CON__P02768-1: ALB 18.247343 17.307058 15.983907 15.569114 16.102201 21.350460 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000
In [8]:
#### Sort the metadata into a more intuitive order
metadata_sort = metadata.sort_values(by = ['experiment'])

#### Visually inspect contaminants  
jvis.MQ_getContaminants_sbplot(contaminants, metadata_sort, width = 1, length = 1.5, layout = 'individual')
<Figure size 432x288 with 0 Axes>

Results: AVERAGE

Wow thats a ton of rubbish in there!

5. Assess Digestion Efficiency

Functions
jinspect.MQ_getMissedCleavages( )
jvis.CommonPalettesAsHex
jvis.BarPlotByGroup_sbplot( )

Assessing missed cleavages is an essential metric for understanding the quality of the tryptic digestion. This data is recorded in the evidence file.
_jinspect.MQgetMissedCleavages( ) will return a long form data table that can easily be used for plotting.
The dictionary jvis.CommonPalettesAsHex contains a number of palettes that are common to both matplotlib and ggplot (from R). These are provided to ensure consistency is easy to achieve across both languages.
We'll plot the missed cleavages with the generic function jvis.BarPlotByGroup_sbplot( )

In [9]:
#### Extract the missed cleavage data into a long form table for plotting
MissedCleavages = jinspect.MQ_getMissedCleavages(evidence, metadata_sort, drop_contaminants = True)

### Select a colour palette  
cpal = jvis.CommonPalettesAsHex

set2_paired = []
for i in cpal['Set2_qual']:
    set2_paired.append(i)
    set2_paired.append(i)

#### Plot the grouped data points  
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(MissedCleavages, x_col = 'group', y_col = '% Missed Cleavages', title = '% Missed Cleavages', pal = set2_paired)
<Figure size 432x288 with 0 Axes>

Results: Average

Digestion efficiency isn't great. Let's reduce the digestion volume by 0.5x and maintain the same trypsin concentration in future experiments

6. Remove Contaminants

Functions
jwrangle.MQ_getThreePassFilter( )
SeqIO.parse( )

After QC we no longer want the contaminants in our data. jwrangle.MQ_getThreePassFilter( ) will remove reverse peptides, contaminants, and only identified by site from MQ tables.
The filter will also accept customised exclusion lists in case users have added odd protein species to the search FASTA tables. In this particular experiment we added to the human FASTA, RNAse proteins and the large T antigen. The former as 1) a check that dynamic range is not being overwhelmed and 2) as an quantitative spike-in control to compare tryptic efficiency and the sample recovery across samples following C18 cleanup.

In [10]:
#### Map the location of the custom FASTA elements
os.listdir(base_path / 'my_resources' / 'FASTA')  

#### Create a list of the non-human proteins that were added to the custom FASTA genome search. 
new_cont = []
with open(base_path / 'my_resources' / 'FASTA' / "custom_proteome_elements.fasta", "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        new_cont.append(record.id.split('|')[1])

#### Remove all unwanted contaminants and IDs from the proteinGroups table      
pGroup_clean = jwrangle.MQ_getThreePassFilter(pGroups_map, custom_exclusion = new_cont)

#### Inspect the cleaned dataframe
pGroup_clean[['ENTREZGENE_gPro primary'] + [i for i in pGroup_clean.columns if 'iBAQ' in i]].head(2)
Out[10]:
ENTREZGENE_gPro primary iBAQ iBAQ nCL_OdT_A iBAQ nCL_OdT_B iBAQ nCL_OdT_C iBAQ nCL_OdT_D iBAQ nCL_OdT_E iBAQ nCL_OdT_F iBAQ 254_OdT_A iBAQ 254_OdT_B iBAQ 254_OdT_C iBAQ 254_OdT_D iBAQ 254_OdT_E iBAQ 254_OdT_F
0 CDK1 19718000.0 0.0 71756.0 465350.0 237490.0 191120.0 190850.0 1368400.0 2187600.0 2246700.0 3012900.0 3454800.0 6291400.0
1 CPEB1 9840800.0 0.0 0.0 0.0 0.0 0.0 0.0 1162000.0 1805400.0 1994700.0 2744200.0 1271100.0 863380.0

7. Drop Gene Duplicates and Filter Intensities by LFQ

Functions
jinspect.MQ_dropDuplicateIDs( )

The next step focuses on improving confidence in the quality of our data. This is done by applying jinspect.MQ_dropDuplicateIDs( ) which has the below effects:

  • Because one gene can have many proteins, sometimes Maxquant will create multiple proteinGroups for a single gene. As most of our analysis focuses on genes we'll trim the lowest quality proteinGroups duplicates from the table.
  • Standard LFQ defaults require a minimum of 2 peptide species, at least one of which must be unique, for quantitation to be applied. Intensity and iBAQ values, however, do not have such a minimum limit. I consider a 2 peptide minimum to be a wise filter but still have use for the Intensity and iBAQ values. Thus where the LFQ filter is applied all measurements that do not meet the minimum limit will be discarded. In short, if there isn't a companion LFQ value, there won't be an Intensity or iBAQ value either after filtering.
  • It has been documented that Match Between Runs suffers a high frequency of false peptide transfers (Lim, Paulo, Gygi 2019; doi: 10.1021/acs.jproteome.9b00492). At the protein level, however, this false transfer rate is greatly mitigated by the minimum peptide rule applied by the LFQ algorithm. This is another good reason for our filtering step.
In [11]:
#### Drop duplicates and apply LFQ filter
filter_dict = jinspect.MQ_dropDuplicateIDs(pGroup_clean, metadata, prefix = 'Peptides', ID = 'ENTREZGENE_gPro primary', pool = 'measure', drop_ID = 'None', 
                                            keep_PoolCalcs = False, applyLFQ_filter = ['Intensity', 'iBAQ'])

#### The df_keep value contains our targets, df_droprows conatins the discarded duplicates. Assign the df_keep value to a new variable and inspect.
pGroup_filtered = filter_dict['df_keep']
pGroup_filtered.head(2)
WARNING: jinspect.MQ_getMeasuredMeansByGroup() has not been checked
Out[11]:
Protein IDs Majority protein IDs Peptide counts (all) Peptide counts (razor+unique) Peptide counts (unique) Protein names Gene names Fasta headers Number of proteins Peptides ... iBAQ 254_OdT_C iBAQ 254_OdT_D iBAQ 254_OdT_E iBAQ 254_OdT_F iBAQ nCL_OdT_A iBAQ nCL_OdT_B iBAQ nCL_OdT_C iBAQ nCL_OdT_D iBAQ nCL_OdT_E iBAQ nCL_OdT_F
0 A0A087WZZ9;E5RIU6;P06493;A0A024QZP7;P06493-2;P... A0A087WZZ9;E5RIU6;P06493;A0A024QZP7;P06493-2;P... 4;4;4;4;3;2;2;2;2;2;2;1;1;1;1;1;1;1;1;1;1;1;1;... 4;4;4;4;3;2;2;2;2;2;2;1;1;1;1;1;1;1;1;1;1;1;1;... 3;3;3;3;3;1;1;1;1;1;1;1;0;0;0;0;0;0;0;0;0;0;0;... Cyclin-dependent kinase 1;Cyclin-dependent kin... CDK1;CDC2;CDK2;CDK3 ;;;;;;;;;; 46 4 ... 2246700.0 3012900.0 3454800.0 6291400.0 0.0 0.0 0.0 0.0 0.0 0.0
1 A0A0J9YXP0;A0A087X171;A0A087WXG7;A0A024R214;Q9... A0A0J9YXP0;A0A087X171;A0A087WXG7;A0A024R214;Q9... 6;6;6;6;6;6;5;5;5;3;2;1;1;1;1 6;6;6;6;6;6;5;5;5;3;2;1;1;1;1 6;6;6;6;6;6;5;5;5;3;2;1;1;1;1 Cytoplasmic polyadenylation element-binding pr... CPEB1 ;;;;;;;;; 15 6 ... 1994700.0 2744200.0 1271100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 149 columns

8. Review Sample Clustering by Group

Functions
jtest.getDistanceMatrix( )
jvis.MQ_showDendrogramQC_mplplot( )

A distance matrix function jtest.getDistanceMatrix( ) is provided for users who wish to apply different algorithms or create different visualisations.
I like the 'ward' method for distance calculations and using a dengrogram to confirm that clustering matches expectations and so use a prerolled function jvis.MQ_showDendrogramQC_mplplot( )

In [12]:
#### Confirm that clustering matches expectations
jvis.MQ_showDendrogramQC_mplplot(pGroup_filtered, 'LFQ intensity', metadata, 'QC clustering: ', grid = 'YES', fsize = (8, 5))

Results: GOOD

QC clustering is as expected.

9. Analyse Normalisation Effects by Sample

Functions
jwrangle.Log2_ByPrefix( )
jwrangle.MQ_poolMulti( )
jvis.ViolinCompare_sbplot( )

Here we review normalisation effects on each sample within the condition groups; these are most easily interpreted after log2 transformation. We will transform all measures of interest with _jwrangle.Log2ByPrefix( ) and then pool all the values of interest, by condition, with _jwrangle.MQpoolMulti( ). The function _jvis.ViolinComparesbplot( ) will let use compare Intensity distribution on a per sample basis.

Normalisation is applied to LFQ values by MaxQuant and is a feature of its handling of label-free data. I've not seen a detailed explanation of how it works though so it is a leap of faith that Cox and Mann have selected an appropriate method.
Normalisation must be applied separately to nCL and cCL groups. This is unusual though necessary to avoid outrageous results caused by having groups with extreme differences. See expt.313 for evidence.

In [13]:
#### Log2 transform available intensity values.
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_filtered, 'LFQ intensity')
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'iBAQ')
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'Intensity')
pGroup_log2.replace(0,np.nan, inplace=True)

#### Create a long form dataset for each desired grouping
pool_SampleIntensity = jwrangle.MQ_poolMulti(pGroup_log2, metadata_sort, melt_list = ['Intensity', 'LFQ intensity'], group = 'condition')
pool_SampleIntensity.keys()
Out[13]:
dict_keys(['nCL', '254'])
In [14]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['nCL'], title = 'nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])
In [15]:
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['254'], title = '254nm cCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])

Results: GOOD

We can see here the effect of normalisation where two groups with different protein levels are processed together for MaxQuant LFQ. This approach is conventional for affinity proteomics and is intended to use quantitation as a means of identifying enriched proteins above background. In conventional affinity proteomics this practice is typically applied to experiments that seek a specific target- either through immunoprecipitation via native epitopes or by catching affinity tags. Generally, the 'beadome' is considered a consistent, and similarly quantifiable, background against which differentially enriched targets can be found.

10. Compare Intensity Distribution and Sequence Coverage

Functions
jwrangle.MQ_poolDataByCondition( )
jvis.BoxPlotByColumn_sbplot( )

Next we will compare intensity and sequence coverage between groups. Log2 transformation has already been performed so we need only use jwrangle.MQ_poolDataByCondition( ) to create the appropriate long form dataset for plotting with jvis.BoxPlotByColumn_sbplot( ).

In [16]:
#### Pool data into a single long form dataset
pooled_dfDropGroupOne = jwrangle.MQ_poolDataByCondition(pGroup_log2, metadata_sort, prefix_list = ['Intensity', 'Sequence coverage'])

#### Compare Intensity distribution using a box and whisker plot
sns.set_style('whitegrid')
jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Intensity: ', 'Intensity')
In [17]:
#### Compare Sequence coverage using a box and whisker plot
sns.set_style('whitegrid')
jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Sequence coverage: ', 'Sequence coverage %')

Result: GOOD

These results are consistent with expectations.

11. Compare Sum Peptide Counts

Functions
jinspect.MQ_getSumBySample( )
jvis.BarPlotByGroup_sbplot( )

To sum the total peptides observed across all proteins use _jinspect.MQgetSumBySample( ). These sums will be returned as a modified metadata table.
Plotting these by group is easily done with jvis.BarPlotByGroup_sbplot( ). The plotting order is determined by the metadata ordering.
In this case we are inspecting the number of peptides detected after having removed contaminants- thus if some spike-in proteins were removed, i.e. in this case RNAse treatments, they will not contribute to the peptide count. To look at the replicability of these spike-ins, we would reach back to the 'df_droprows' table generated by jinspect.MQ_dropDuplicateIDs( ) in section 7.

In [18]:
#### Extract the total peptides observed per sample
metaStats = jinspect.MQ_getSumBySample(pGroup_log2, metadata_sort, freqList = ['Peptides'], measure = False)
metaStats
Out[18]:
experiment condition replicate sample measure MQgroups Peptides
0 NC-A nCL A nCL_OdT_A Intensity RIC-OdT 3302.0
1 NC-B nCL B nCL_OdT_B Intensity RIC-OdT 3453.0
2 NC-C nCL C nCL_OdT_C Intensity RIC-OdT 5220.0
3 NC-D nCL D nCL_OdT_D Intensity RIC-OdT 5427.0
4 NC-E nCL E nCL_OdT_E Intensity RIC-OdT 4766.0
5 NC-F nCL F nCL_OdT_F Intensity RIC-OdT 4427.0
6 PC-A 254 A 254_OdT_A Intensity RIC-OdT 14021.0
7 PC-B 254 B 254_OdT_B Intensity RIC-OdT 14273.0
8 PC-C 254 C 254_OdT_C Intensity RIC-OdT 15385.0
9 PC-D 254 D 254_OdT_D Intensity RIC-OdT 15268.0
10 PC-E 254 E 254_OdT_E Intensity RIC-OdT 15515.0
11 PC-F 254 F 254_OdT_F Intensity RIC-OdT 15775.0
In [19]:
#### Plot the sum peptides
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Peptides', title = 'Sum Peptides vs Silica Capture', pal = set2_paired, errorbars = 'SEM')
<Figure size 432x288 with 0 Axes>

Results: GOOD

Clearly the cCL sample retrieves the most protein.

12. Compare Unique Gene Counts

Functions
jinspect.MQ_getFrequencyBySample( )

One gene can encode for many proteins that often share regions of similarity. As for illumina-based RNA-Seq, however, shotgun proteomics can rarely assign a peptide species to a singular protein. In MaxQuant these are called proteinGroups. Because we have do not require protein-specific results, and gene identity is more stable, our gene count describes the groups to which our detected proteins have been be assigned. Thus gene here is being detected by protein product, just as it would be detected by RNA product in RNA Seq; none of these 3 are synonymous. To be clear, this is a count and not a measure.

Gene frequency is defined by the summed observations per protein regardless of intensity value and this data is extracted to our modified metadata with jinspect.MQ_getFrequencyBySample( ) .
A typical MQ search will yield identical protein counts (though different values) for Intensity and iBAQ*. LFQ frequencies will vary depending on the search settings:

  • In this case the MQ search has set LFQ values to be calculated on a min 2 peptide ratio (this is the default)**

Notes
* Why protein counts should be identical I don't know. The original iBAQ paper stipulates rules for the inclusion of a protein in the iBAQ calculation but MaxQuant doesn't seem to apply them.
** Previously I tested LFQ min ratio at 1 peptide. At 1 minimum peptide there was unexpected QC clustering. Possible explanations for this are explained in section 7 and are cleaned up by jinspect.MQ_dropDuplicateIDs( ) function. We can expect this function to greatly reduce qualifying IDs (~20% fewer), especially in the QE samples, but I think the trade-off is worth it because we gain 1) a more robust ID check and 2) the same search can be used for LFQ based checks of dynamic changes, i.e. comparing more than one group of cCL captures for biological changes.

In [20]:
#### Count the number of unique 
metaStats = jinspect.MQ_getFrequencyBySample(pGroup_log2, metaStats, freqList = ['Intensity', 'iBAQ', 'LFQ intensity'], measure = False)
metaStats
Out[20]:
experiment condition replicate sample measure MQgroups Peptides Intensity iBAQ LFQ intensity
0 NC-A nCL A nCL_OdT_A Intensity RIC-OdT 3302.0 386 386 386
1 NC-B nCL B nCL_OdT_B Intensity RIC-OdT 3453.0 411 411 411
2 NC-C nCL C nCL_OdT_C Intensity RIC-OdT 5220.0 620 620 620
3 NC-D nCL D nCL_OdT_D Intensity RIC-OdT 5427.0 644 644 644
4 NC-E nCL E nCL_OdT_E Intensity RIC-OdT 4766.0 563 563 563
5 NC-F nCL F nCL_OdT_F Intensity RIC-OdT 4427.0 536 536 536
6 PC-A 254 A 254_OdT_A Intensity RIC-OdT 14021.0 1139 1139 1139
7 PC-B 254 B 254_OdT_B Intensity RIC-OdT 14273.0 1135 1135 1135
8 PC-C 254 C 254_OdT_C Intensity RIC-OdT 15385.0 1231 1231 1231
9 PC-D 254 D 254_OdT_D Intensity RIC-OdT 15268.0 1219 1219 1219
10 PC-E 254 E 254_OdT_E Intensity RIC-OdT 15515.0 1287 1287 1287
11 PC-F 254 F 254_OdT_F Intensity RIC-OdT 15775.0 1545 1545 1545
In [21]:
#### Plot the counts
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Intensity', title = '# Genes Detected By Group', pal = set2_paired, ylabel = 'Unique Genes', errorbars = 'SEM')
<Figure size 432x288 with 0 Axes>

Results: INTERESTING

It is clear that the nCL and cCL samples are very different both in intensity, peptide count, and unique gene count. When samples are normalised they are, as a pool, each drawn towards overall quantitive agreement as a means of flattening out technical variability. The aggressiveness of this process can have the side-effect of making true variability less apparent. This is a challenging task that is aided by the presence on some level of ratio-wise stable variation between samples because such 'background' will reduce the incidence with which true variability is improperly adjusted. Generally speaking, most normalisation algorithms can tolerate up to 30% quantitative difference- a limit that is exceeded by RIC capture. Thence, it's application is not without some problems. This topic, specifically within the MS-proteomics context, is discussed more thoroughly in my thesis.

13. Assess Replicate Correlation

Functions
jwrangle.MQ_getSliceByPrefix( )
jvis.showPearsonRegression_altair( )

The function _jwrangle.MQgetSliceByPrefix( ) provides a convenient means of extracting values of a specific group.
We can then use _jvis.showPearsonRegressionaltair( ) to perform pairwise comparisons between each member of those groups. This function is specifically applied to genes with shared intensities- genes exclusive to one sample or the other, represented by vertical or horizontal datapoints, are plotted but excluded from the pearson calculation.

In [22]:
#### Extract the intensity values as a dictionary where keys = groups
Intensity_Dict = jwrangle.MQ_getSliceByPrefix(pGroup_log2, metadata, 'Intensity', group = 'condition', add_col = None)
Intensity_Dict.keys()
Out[22]:
dict_keys(['nCL', '254'])
In [23]:
#### Check replicate consistency across all within group pairs
jvis.showPearsonRegression_altair(Intensity_Dict['nCL'], mark_color = set2_paired[2])
jvis.showPearsonRegression_altair(Intensity_Dict['254'], mark_color = set2_paired[4])

Results: GOOD

Replicates look good among both nCL and 254nm cCL samples.

14. Conventional T-Test Analysis

Functions
jinspect.MQ_getFrequencyByGroup( )
jinspect.MQ_filterValidValues( )
jvis.MQ_showImputationHistogram_mpl( )

A conventional T-Test analysis can be applied to find our significantly enriched RBPs. The function jinspect.MQ_getFrequencyByGroup( ) will allow us to tally the number of observations being made per protein group, per sample. Using this information jinspect.MQ_filterValidValues( ) then allows us to discard proteinGroups that do not meet a minimum number of real observations. This limit is user-defined, but a common practice is to include species which were observed >70% of each pool. Missing values are commonly imputed by random distribution, with the intention of achieving higher statistical power. An imputation option is included in the same function and the impute values, once extracted can be plotted with jvis.MQ_showImputationHistogram_mpl( ).

In [24]:
#### Use the metadata and proteinGroups tables to count how many times a gene is identified in its group (/6). 
#### Here I demonstrate how we can count for all instances of Intensity, iBAQ and LFQ Intensity
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_log2, metadata, 'iBAQ', group = 'condition')
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'LFQ intensity', group = 'condition')
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'Intensity', group = 'condition')

#### Filter the dataset for valid values and generate an imputed variant
RIC_preImpute = jinspect.MQ_filterValidValues(pGroup_Freq, prefix = 'LFQ intensity', group_filter = {'nCL':0, '254':4}, 
                                              sample_list = metadata['sample'], rship = 'equal or more', impute = False, 
                                              keep_cols = ['Majority protein IDs', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name', 'UNIPROT_gPro status'])

RIC_postImpute = jinspect.MQ_filterValidValues(pGroup_Freq, prefix = 'LFQ intensity', group_filter = {'nCL':0, '254':4}, 
                                               sample_list = metadata['sample'], rship = 'equal or more', impute = True,
                                               keep_cols = ['Majority protein IDs', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name', 'UNIPROT_gPro status'])

#### To review the extent and distribution of imputation in our dataframe
#### First calculate the imputed values by finding the difference between the pre and postimpute dataframes
RIC_imputed = RIC_postImpute[~RIC_postImpute.isin(RIC_preImpute)]

#### Then plot
jvis.MQ_showImputationHistogram_mpl(RIC_postImpute, RIC_imputed, prefix = 'LFQ intensity', sample_list = metadata['sample'], title = 'Imputation Distribution')

Results: GOOD

Well as good as any imputation can be. Its worked, and the bulk of the imputed values fall in the lower intensity range.

15. T-Testing for Traditional Enrichment Analysis

Function
jtest.applyIndependentTTest_ImputeCheck( )

Once a complete dataframe is generated it is ready for T-Testing. The function jtest.applyIndependentTTest_ImputeCheck( ) will perform our T-Test, apply FDR, calculate fold-changes, and assess the dependency of each p-val on imputation and create datapoints for volcano plotting. It is desirable to assess significance with respect to imputation dependency because it allows us to discriminate real values from what are essentially very loosely modeled (I would label fictional) values. Thus a number of calculations into a single output:

  • T-Test
  • Benjamini-Hochberg FDR corrected p-values (default, FDR = 1%)
  • Q-value FDR corrected p-values (default, FDR = 1%)
  • Significance assessment (default, p < 0.05, fold change min. 3x)
  • Log2Fold Change in both directions (enrichment and depletion)
  • Imputation Dependency (for BH and q-value): Reports whether a finding of significance could not have occurred unless imputed values were present.
In [185]:
#### Calculate enrichment statistics
RIC_tTest = jtest.applyIndependentTTest_ImputeCheck(RIC_preImpute, RIC_postImpute, 'nCL', '254', 'LFQ intensity', metadata)
RIC_tTest[[i for i in list(RIC_tTest.columns) if 'LFQ intensity' not in i]].head(2)

#### Save the table locally
RIC_tTest.to_csv('e299_RIC_tTest.csv', index = False)
In [171]:
#### We can use this table to follow up with a simple volcano plot
sns.set_style('darkgrid', {'axes.grid' : True})
plt.figure(figsize = (10,8))
ax = sns.scatterplot(data=RIC_tTest, legend = 'brief',
                     x="Log2FC 254", y="-Log(BH FDR p-val)",
                     hue='Impute Dependency (BH)', palette = cpal['Set2_qual'][:2], s=60, edgecolor = 'black', linewidth = 0.2)

ax = plt.axhline(y = 1.3, xmin = 0, xmax = 1, color = 'black', linestyle = '--', linewidth = 1)
ax = plt.axvline(x = 1.585, ymin = 0, ymax = 1, color = 'black', linestyle = '--', linewidth = 1 )
ax = plt.axvline(x = -1.585, ymin = 0, ymax = 1, color = 'black', linestyle = '--', linewidth = 1 )

plt.legend(loc = 'best', frameon = False)
Out[171]:
<matplotlib.legend.Legend at 0x244cc494748>
In [186]:
#### Or with a more complex, interactive volcano plot
source = RIC_tTest
source['ENTREZ name'] = source['ENTREZGENE_gPro name'].apply(lambda x: x.split(' [S')[0])
source['y'] = 1.3
source['x0'] = 1.585
source['x1'] = -1.585

# Brush for selection
brush = alt.selection(type='interval')

# Scatter Plot
points = alt.Chart(source).mark_point().encode(
    x='Log2FC 254:Q',
    y='-Log(BH FDR p-val):Q',
    color=alt.condition(brush, 'Impute Dependency (BH):N', alt.value('grey')
    ),tooltip=['ENTREZGENE_gPro primary:N', 'ENTREZ name:N']
    ).add_selection(brush).properties(width=800, height=500
    ).properties(title = 'placeholder')

c2 = alt.Chart(source).mark_rule(strokeWidth=0.5, strokeDash=[3]).encode(y='y:Q')
c3 = alt.Chart(source).mark_rule(strokeWidth=0.5, strokeDash=[3]).encode(x='x1:Q')
c4 = alt.Chart(source).mark_rule(strokeWidth=0.5, strokeDash=[3]).encode(x='x0:Q')

# Base chart for data tables
ranked_text = alt.Chart(source).mark_text().encode(
    y=alt.Y('row_number:O',axis=None)
).transform_window(
    row_number='row_number()'
).transform_filter(
    brush
).transform_window(
    rank='rank(row_number)'
).transform_filter(
    alt.datum.rank<27
)

# Data Tables
primary = ranked_text.encode(text='ENTREZGENE_gPro primary:N').properties(title='Gene')
name = ranked_text.encode(text='ENTREZ name:N').properties(title='Name')
status = ranked_text.encode(text='UNIPROT_gPro status:N').properties(title='Status')
text = alt.hconcat(primary, name, status) # Combine data tables

# Build chart
alt.hconcat(
    points + c2 + c3 + c4,
    text
).resolve_legend(
    color='independent'
)
Out[186]:

Results: GOOD

n=725 genes are identified as coding for RBP. This is consistent with n=4 previous experiments with HeLa and HEK293T cells in both 4SU 365nm crosslinking and 254nm crosslinking. The extent of imputation dependence, however, is quite high in the data.

16. Extract Significantly Enriched Hits

Given the table output from jtest.applyIndependentTTest_ImputeCheck( ) it is easy to extract the significantly enriched hits. FDR-corrected p values are provided for both the Benjamini-Hochberg method and the q-value method. The user should also select an appropriate fold change cut off; typically I like to choose a 3x cutoff which would be log2(3) = 1.58.

In [183]:
RichList = RIC_tTest[(RIC_tTest['sig (BH FDR p-val)']=='significant') & (RIC_tTest['Log2FC 254'] > 1.58)]['ENTREZGENE_gPro primary']

with open('e299_enriched_ricRBP.txt', 'w') as f:
    for item in RichList:
        f.write("%s\n" % item)
        
RichList
Out[183]:
0         CDK1
1        CPEB1
2        HDLBP
6       MRPS21
7       DCAF13
         ...  
1825     RBM8A
1829     SPCS1
1832     DDX49
1864     ACIN1
1865     SAP18
Name: ENTREZGENE_gPro primary, Length: 725, dtype: object

Conclusion

RIC can applied to single shot MS identification of RBP via label-free practices with the following characteristics:

  • 725 RBP candidate genes have been found, consistent with previous experiments and reports.
  • These were found under a 3x fold change cutoff (which is higher than the 2x previously applied by Castello)
  • These were found using an Orbitrap QE Classic, a machine 1 generation older than that used by Castello
  • Missing values are an issue: this fact is demonstrated by the number of significance calculations which depend on imputation

For a dissection of the biological characteristics of these hits see previous experiments (i.e. expt.15 and expt.14) or see the original papers by Castello et al. (2012) and Baltz et al. (2012).